## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599
## [1] 13
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
1599 observations and 12 variables (first variable is just sequence number)
Normal-Like! Distribution: Quality,Density,pH left-skewed:fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide,total.sulfur.dioxide and alcohol
The quality of wine may be related with each of other variables Univariate exploration can not clearly determine the relationship between indep&depend variable So I need to dive into more with bivariate exploration.
Not yet.
Residual.sugar has a long tail, I use log function to adjust data citric.acid has many zero data and long tail, I use sqrt function to concentrated data and avoid infinite number Also, quality is not continuous variable, so I turn it into factor type.
## [1] "Summary of variable fixed.acidity By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 94 18.737 6.283 8.79e-06 ***
## Residuals 1593 4751 2.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable volatile.acidity By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 8.22 1.645 60.91 <2e-16 ***
## Residuals 1593 43.01 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable citric.acid By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.53 0.7059 19.69 <2e-16 ***
## Residuals 1593 57.11 0.0359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable residual.sugar By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 10 2.094 1.053 0.385
## Residuals 1593 3166 1.988
## [1] "Summary of variable chlorides By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.066 0.013162 6.036 1.53e-05 ***
## Residuals 1593 3.474 0.002181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable free.sulfur.dioxide By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 2571 514.1 4.754 0.000257 ***
## Residuals 1593 172274 108.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable total.sulfur.dioxide By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 128045 25609 25.48 <2e-16 ***
## Residuals 1593 1601155 1005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable density By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.000230 4.594e-05 13.4 8.12e-13 ***
## Residuals 1593 0.005462 3.430e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable pH By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.51 0.10242 4.342 0.000628 ***
## Residuals 1593 37.58 0.02359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable sulphates By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.00 0.6000 22.27 <2e-16 ***
## Residuals 1593 42.91 0.0269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Summary of variable alcohol By quality"
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## [1] "One-way ANOVA test"
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 483.9 96.79 115.9 <2e-16 ***
## Residuals 1593 1330.8 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Correlation matrix for independent numeric variables"
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 0.256 0.672
## volatile.acidity 0.256 1.000 0.552
## citric.acid 0.672 0.552 1.000
## residual.sugar 0.115 0.002 0.144
## chlorides 0.094 0.061 0.204
## free.sulfur.dioxide 0.154 0.011 0.061
## total.sulfur.dioxide 0.113 0.076 0.036
## density 0.668 0.022 0.365
## pH 0.683 0.235 0.542
## sulphates 0.183 0.261 0.313
## alcohol 0.062 0.202 0.110
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.115 0.094 0.154
## volatile.acidity 0.002 0.061 0.011
## citric.acid 0.144 0.204 0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 0.022
## pH 0.086 0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 0.221 0.069
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.113 0.668 0.683 0.183 0.062
## volatile.acidity 0.076 0.022 0.235 0.261 0.202
## citric.acid 0.036 0.365 0.542 0.313 0.110
## residual.sugar 0.203 0.355 0.086 0.006 0.042
## chlorides 0.047 0.201 0.265 0.371 0.221
## free.sulfur.dioxide 0.668 0.022 0.070 0.052 0.069
## total.sulfur.dioxide 1.000 0.071 0.066 0.043 0.206
## density 0.071 1.000 0.342 0.149 0.496
## pH 0.066 0.342 1.000 0.197 0.206
## sulphates 0.043 0.149 0.197 1.000 0.094
## alcohol 0.206 0.496 0.206 0.094 1.000
According to the boxplot, we can notice that quality has a clear negative relationship with volatile.acidity and a clear positive relationship
Based on correlation matrix for dependent numeric variables, I notice that binary viriables with strong correlation coefficient are talking about same thing(free.sulfur.dioxide & total.sulfur.dioxide=0.668/fixed.acidity & citric.acid=0.672)
The correlation coefficient between fixed.acidity & pH is 0.683, which is the highest among all possible pairs of variables
From Bivariate Analysis, I try to combine those key variables together. For different quality of wine, more total.sulfur.dioxide brings more free.sulfurdioxide
Interestly, stronger relationship between fixed/citric.acid brings higher quality of wine~!
alcohol content and density have a negative relationship for any quality of wine
density and fixed.acidity have a positive relationship in any quality of wine
We can build our relationship with a new data set by Univariate Analysis, and develop relationship by Bivariate Analysis. Multivariate Analysis may help us find deeper and more interesting information for a huge amount of data. The amount of data in Red Wine dataset is not huge, for this problem I applied what I learned from Udacity, so this dataset is just enough for a beginner to do some works. But I still need to study more(eg. the extra materials given by instructor) to make me better! Machine Learning Algorithm may help us to explore dataset, find potential rules and predict quality of new observation. So after I pick up some Machine Learning basic method from next topic in udacity, I will come back again to apply what I learned by both R and Python!